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1. INTRODUCTION 

In recent years, large data sets and the computing power offered by graphics processing units (GPUs) 
have been motivated by research into deep-learning algorithms that have shown excellent performance in 
various computer vision tasks and achieved a decisive action over traditional method. The fundamental concept 
of cloud computing is that user data are not recorded locally, but are placed in the data center of the internet. 
These data centers could be managed and maintained by the companies that provide cloud computing services. 
Users can access the stored data at any time through any internet-connected terminal equipment using the 
application programming interface (API) provided by cloud providers. The immense growth of social media, 
ecommerce traffic and various web services has significantly raised the need for computational services [1]. 
One of the heuristically approaches of optimization for action extraction is ant colony optimization (ACO) 
along with particle swarm optimization. Particle swarm optimization (PSO) optimizes a problem by preserving 
a particle population and transferring these particles into the search area. Both the ACO and PSO algorithms use 
one-classification rules sequence of covering patterns. The construction in machine learning models was an 
active research topic. The main learning machine models to learn group classifiers in high-dimensional datasets 
are boosting, bagging or stacking [2]. Feature extraction is the essential component in the IoT based 
authentication frameworks. Since, different facial features have different orientations, texture, and intensities, it 
is difficult to find the essential features in the large datasets. Feature selection algorithms can be classed into 
adaptive, statistical, and semi-supervised supervised search models, that is, wrapper models, and embedded 
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models could be used to describe all feature selection techniques. The feature selection model is different from 
classifier learning, as it does not assume that the learning algorithm is biased. 

The degree of uncertainty in this will be dependent on varies according to the nature of the training 
data, its variability, dependency, and interdependence [3]. A deep learning algorithm designed for the operation 
of two-dimensional image data, the convolution neural network (CNN). The architecture of CNN is different. 
The two main components are convolution and pooling layers. Although it seems simple, these elements could 
be arranged infinitely. The tuning of hyper-parameters is the main process that affects the prediction 
performance for the CNN model (including its architecture and parameters) [4], [5]. The tradition is that the 
network's performance is tested manually when the value of each parameter is changed while the other is 
retained. In particular, when the data set is large and the available resources are limited, this is computationally 
expensive. The success of using machine teaching methods for every prediction depends on the best architecture 
to be found and on the hyper parameters to be tailored to the given problem and produce an exact result. This 
included many proposals, redundant suggestions, many fake positives, and difficult to collect representative 
semantic information in complex contexts. These limitations included many of the proposals [6]. The rapid 
development of profound learning resulted in large margins for detecting deep learning algorithms for objects 
over traditional feature extraction algorithms. Several studies have been performed for the detection of action in 
the ambient assisted living (AAL) environment [6]. Traditional approaches to object detection are generally 
based on manufactured properties for the location of objects in each image. Three major steps are often taken in 
these methods: proposals, extraction and classification. This feature vectors were usually encoded in low-level 
visual descriptors like scale invariant feature transform (SIFT) [7], Haar [8], histogram of oriented gradient 
(HOG) [9] or speeded-up robust features (SURF) [10], showing certain scale, light and rotational variance 
robustness. During the classification phase, categorical labels are assigned to the regions covered. Methods for 
classification include support vector machine (SVM) and AdaBoost. Although traditional methods have shown 
good performance in many benchmarks’ public datasets, in difficult conditions they still have many restrictions. 
During classification, this resulted in many fake positives. Secondly, feature designers are hand-made based on 
low-level visual indicators, making it difficult, in complex conditions, to capture representative semantical 
information. Finally, each detection pipeline step is separately designed and optimized and therefore the entire 
system cannot be provided with an optimal overall solution. Following the success of the application of 
profound CNNs for image classification, object detection has also made significant progress based on deep 
CNNs. Here, wrappers take advantage of learning techniques that highlight the most attractive features. 
Supervised feature selection increases classification efficiency while simultaneously by reducing computer 
processing time. In recent years, filter-based feature selection criteria have been devised, e.g. fisher score, trace 
ratio feature, and relief correlation feature (CFS) [11]. Feature selection improves the raw patterns. Feature 
selection algorithms fall into three categories: wrapper, filter, and hybrid, depending on their approaches. While 
the wrapper technique focuses on features and uses a statistical measure to identify significance. While the filter 
technique is quicker, the wrapper technique is more accurate. Combination of exploration and exploitation with 
meta-heuristic techniques leads to an effective result there are diverse methodological approaches suggested to 
mimic any natural phenomenon or process exploration search balancing with exploitation governs the ability of 
such techniques to avoid local optimal and global optimal values [12]. 


2. RELATED WORKS 

Feature extraction is a basic step that directly affects the outcome of recognition systems-a poor 
choice of descriptor can considerably degrade performance and precision. Finding the relevant descriptor is 
based on trial-and-error method and the large number of features in dataset. One of the primary benefits of 
feature learning techniques in relation to handcrafted extraction is the generalization of the feature space in 
the same visual domain. In hexagonal-volume local binary pattern (H-VLBP), the binary pattern histogram 
encodes local volumes [13]. Despite its simplicity, the number of separate patterns generated in 
neighbourhoods’ regions by VLBP may become overwhelming. The convolution architecture efficiently uses 
the image structure by "pooling" and "weight-sharing" to reduce the search space of the network. Pooling and 
weights initialization help to achieve robustness across differences in scale and space. To optimize this issue, 
they introduce 3D convolution networks. Traditional models are focused on constructing efficient descriptors 
or characteristics and then classifying them based on matching features. Here, feature selection measures or 
filters are used to recognize various key features in anomaly detection process. Global features include 
silhouette-based descriptors, edge-based features, optical flow-based display, and movement history image 
(MHI), are used in CNN models [14]. Occlusions, changing viewpoints, and noise often create problem in 
global features. Local characteristics always use image patches separately, and then these patches are 
combined to create a space-time model such as SURF and HOG [15]. Local descriptors, particularly for noise 
images and partly occluded images more efficiently. CNNs have proven to be strong feature extraction model 
for still image recognition. The recent interest in one-stage methods has renewed between the single shot 
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detector (SSD) and you only look once (YOLO) [16]. These sensors are tailored for speed but their precision 
is based on two-stage methods. SSD uses multi-layer defaults within ConvNet from boxes of different scales 
and forces each layer to concentrate on a prediction of objects of a particular scale. MS-CNN applies multi- 
layer devolution on ConvNet to increase the map resolution before using the layers to learn regional 
proposals and pool features to improve detection precision on multi-level layers. Most CNN models are 
designed on the basis of convolution and pooling intakes, then sampling regularly heights and width while 
increasing the number of function maps [17]. InDenseNet, dense connected in networks. DenseNet is 
somewhat logical behind reset transmission information of a layer to another, but in densenet, each map of 
the functionality of every layer is linked to the input of each subsequent layer in a dense building block this is 
the most commonly used nonlinear down sampling strategy for translation invariance [18]. This enables next 
layers to directly access previous layer functions and allows reuse of network features. DenseNet's building 
block is the dense blocks. Each dense block has several overlapping layers. 

A dense block is followed by a transition layer and even the next dense block is its output. The 
advantages of DenseNet are several: solving the problem of the flattening gradient, stimulating the 
proliferation and reuse of features and reducing parameters because redundant function maps are not 
necessary. Mobile net: architectural lightweight. It uses profoundly separable convolutions which basically 
mean it does not merge and flatten all 3, but performs a single convolution on each channel. It permits 
filtering of input channels. Deep, smart, separable convolution reduces network complexity and model sizes 
that are suitable for mobile or low-computing devices [19], [20] conduct studies on biometrics using a 
smartphone gyroscope to detect users. Touches, face patterns and phone alignment on smartphones have been 
used to present an unrestricted, implicit biometric multimodal system. The 95 subjects were chosen for the 
collection of various touch and phone movement patterns for mobile multimodal data set. The results have 
been shown to be accurate and to improve usability and security. Huang et al. [21] developed and proposed a 
biometric recognition multimodal system using two distinct biometric features, that of the back and the palm 
veins. The binarization technique was initially used for image preprocessing. The researchers then employed 
the morphological dilation method, together with an average filter for image smoothening, to remove smaller, 
undecoupled objects. Filtering and thresholding were performed in two steps for the extraction of features. 
The researchers then used the model predictive control (MPC)-method and K-means for the extraction and 
matching of the LBP and template processes. The MPC is the MPC method for extracting features. For 
modeling matching with a relative 1.6d side length the energy efficiency ratio (EER) value was 0.01965%, 
while there was 0.058% for the matches LBP matching mechanism to a relative 1.5d side length. A CASIA 
v1.0 dataset was used for research. The competitive valley hand detection (CHVD) ROI removal process was 
employed. After that the features were extracted using one of 3 different methods: LBP, 2D local binary 
(2DLBP) pattern or a combination of the features (LPB and 2DLBP). The researcher used principal 
components analysis (PCA) and selected the featuring reduction technique using the liner discrimination 
analysis (LDA) approach. The main component features were used by the researchers. The CASIA palm 
printing database for the determination of ED-like results was used for experiments. After the application of 
the third process (LBP+2DLBP) they observed 98.55% precision. Although this new approach for combining 
palm printing was used the scientists felt it could deliver accurate results for venous patterns. Koley et al. 
[22] addresses the issue of users' authentication of biometric facials. Proposed authentication neural network 
model based on a two-layered perceptron with input neurons 90, hidden neurons 10 and output neurons 4. 
The specified network architecture parameters have been experimentally calculated. It is noted. Input 
parameters include the geometry of local features: co-ordinate (X), coordination (Y), vector direction (Q). 
Local 30 features are commonly used. The neural network classifier of selection deviations is shown 
experimentally to have the level of mistake of the first type of 5.2% and the level of error of the second kind- 
0%. On that basis, the model of the built neural network in conjunction with other technologies of biometric 
authentication is argued. Similar results are described by [23] that further characterize the mathematical 
system on which the operation of two-layer perceptron is based. The use of the neural networks in the facial 
identification system is covered in [24]. It is shown that difficulties in identifying are because the image 
generated by the scanner may differ slightly during each scan during the facial scanning process. For these 
problems, a multi-layered perceptron with a hidden layer is proposed to use the neuronal network model. Its 
simplification and approval explain the choice of neural network architecture. A facial image of 188x240 
pixels is the source of input information for the neural network model. 12 geometrical moments, each 
corresponding to one of the input neurons, are calculated. The number of neurons produced is six. In the 
theorem of Kolmogorov, the numbers of hidden neurons are 25. As an activation feature, Sigmoid is used. 
The network has been trained using the conventional back spreading algorithm. There were 100 printed 
samples. It is 1,000 times of training. The accuracy claimed of training recognition examples was 100%, 
indicating the perspectives of facial recognition networks in the neural network, according to [25] that facials 
made by different individuals may be equal in global features, but that it is impossible for them to be equal in 
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local features (minutiae). Consequently, the process of identification usually comprises two steps. Initially, 
facials should be classified according to global criteria by dividing them into classes using databases. The 
second step is the identification of the facial on the basis of a structural comparison and the coincidence 
factor of the detailed points. 

The proposed algorithms for facial images classification based on the Gabor filter application, 
transform the wavelet Haar, Daubechies and multi-scale of the neural network according to type of models. 
Numerical experiments are performed and the results are presented for the proposed algorithms. A five-level 
Daubechies transform wavelet and a multi-layer neural network-type double-layer positron algorithm based on 
the combined application of Gabor filters, is shown to achieve a classification precision of about 75%. Liang et 
al. [26] a research of facial image methods based on neural networks, such as the two-layer architecture for 
perceptions are used. As input parameters of the neural network, the module and the argument of a vector field 
in the image gradient are used. The conclusion is drawn about the need to increase the neural network input 
vector power to 400. Padol and Yadav [27] details the local and global features of the biometric authentication 
systems facial. It is demonstrated that it depends largely on the quality of the facial image to distinguish the 
features which can be used in the future in the identification process. The standard facial scanners are indicated 
as providing a 500-dpi resolution, the image is characterized by a luminosity level of 256, and a maximum 
vertical rotation angle of 15 degrees is indicated. At the same time, the end points where the papillary lines end 
"distinctly" and the branch points where the papillary lines are bifurcated, as characteristic features are 
proposed. Please note that the images toward the surface having a resolution approximately about 1,000 dpi 
are possible to identify, detect or reap in greater detail the internal composition on papillary lines (glands of the 
suds), using that finger surface properties and enable a significant improvement with the accuracy of 
identification. The level of technical support currently available for common biometric authentication systems 
doesn’t however allow images of this type to be obtained. Reza et al. [28] take into consideration the 
technology for the design and operation of the two-tier facial network recognition system. The allocation of 
features is done at the first level and the analysis of features chosen is carried out at the second level as a result 
of which a user is identified. The accuracy of the selection of informational features of facials, which in turn 
depend largely on the quality of the recognized image, has been demonstrated to affect the performance of the 
recognition system. This is why a module is present in the system to improve the quality of the original (from 
the scanner) image. The feature of this module is that the quality of the facial structure should not be damaged 
by minutiae. To that end, we propose Gabor filters that make the gray print a white image and only 1-pixel 
wide skeletonization. To this end, we propose to use Gabor filters. Neural networks are used in the proposed 
system at both detection levels. The hidden layer of 200 neurons and the transfer function is a symmetrical 
Sigmoid, each in which there is a two-layer perception. The neuron output for the two-layer sensor at the Ist 
level recognition is 5 and the 2™ level for the same sensor. The empirical nature of the determination of the 
parameter is noted. The numerical results show 92% accuracy in recognition. It is stated that the use of deep 
neural networks and a parallel of computer learning and reconnaissance processes are part of further research. 
The effectiveness of facial clusters in relation to probability of fluid grades based on modern neural networks 
is compared [29]. The main condition for the study is that deeper neural networks can be used for graphic 
image analysis. A multilayered sensor with 3 hidden neuron layers was used as an underlying model. Pre- 
training perceptron implemented with sparse auto encoder. The results show that the change in the number of 
hidden neurons from 200 to 1,250 is not related to the exactness of recognition of around 93%. 
Simultaneously, some 97% had higher results for a classification using a fuzzy classifier. The authors could 
conclude that neural network methods are ineffective in facial recognition. At the same time the article does 
not support the experimental plan to accurately detect the structural parameters of neural network models. It 
also raises the question of the suitability of the pre-training process, which is critical when labeled examples of 
training are not enough and the logistical function has been used [30], a multimodal biometric system has been 
developed with face iris and ear mode. Issues of the traditional approaches: i) traditional binary classification 
approaches use static feature extraction measures for facial features analysis, ii) traditional approaches require 
high computational time for large number of candidate facial features, and iii) binary classification uses 
limited features space in order to predict facial features. 


3. PROPOSED MODEL 
3.1. Multi-level facial feature extraction framework 

Most of the traditional frameworks are used to filter facial features feasibility with convolution 
kernels of 3x3x3. Here the most essential data security features are found in different convolution layers, 
maximum pooling and filters. Using soft max activation function as shown in Figure 1, the completely 
connected layer is used to filter the essential functions in the image. These characteristics are used to classify 
biometric characteristics with the proposed classification model. 
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Figure 1 describes the overall framework of the proposed approach for multi-modal facial features 
extraction and classification process. In this approach, different facial features are taken as input for feature 
extraction process. In this work, essential key features are extraction using facial key points extraction 
approach, scalable gradient-based features extraction, facial features using curvature measures, log inverse 
differential moment, and max correlation measures. These set of feature extraction measures are used to 
extract different facial key candidate sets in the framework. There are several layers in which the hidden 
layers between the input and the output layer are completely connected. However, the main issue is the 
prediction of the key characteristics because of their high size. In order to overcome this challenge, a neural 
network model based on CNNs architecture is used for larger applications in the field of image processing. 
Rotation, translation or scaling nodes for group layers are employed in computer image tasks to model an 
object in a different patch or dimension. With these connections, the network will develop although the input 
connection is static and the nearby units of each network device are heavily influenced. The proposed C3D 
network is used to locate the low-level features and filter out the key features of each image. In the 
framework, these feature extraction measures along with convolution filters are used to filter the essential key 
feature sets for the fully connected layer. Filtered features in the fully connected layer are given to ensemble 
classification approach for multi-level classification process. In this work, a novel kernel function 
optimization based SVM model is proposed in order to predict the multi-level facial features in order to 
improve the true positive and error rate. 


3.2. Facial feature measures 

In the facial features extraction, essential key points are extracted using the proposed key point’s 
extraction method. The user’s facial expressions are evaluated using the proposed key point’s extraction 
process. The existing dynamic chaotic map used only two parameters a and B. Moreover, the chaotic region 
can easily be predictable as the weighted parameters are fixed and as rare ranging from 0 to N. 


3.2.1. Facial key points extraction 

The steps for extracting facial key points are presented as follows: 
Step 1: to each frame in the video file. 

Step 2: initialize each frame to Vi(x, y) for key features extraction. 
Step 3: apply image normalization using (1)-(5): 
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Where y=(Max-0)=(255-0)=255, 4, U2 E (120,150)v,, v2 E (250,255). 
Step 4: apply the facial pattern scaling filter using (5) as (6): 
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Step 5: in this step, different gradient features are identified using the scaled image S(x,y) as (7): 
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Where V is gradient and G is the gaussian function. 
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Figure 1. Facial feature extraction-based framework 


3.2.2. Facial eye and lip curvatures 
Let ay, by are the major and minor axis of facial image. 
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Iris gabor filter(p, q) = exp (— 75 


where 
p' = pcos (0) + qsin (8) 
q' = qcos (8) — psin (8) 


where À is the sinusoidal function wavelength in f = = 8 is Gabor filter orientation, Y = 90° is a true value of 
a 


Gabor filter phase offset, Y = 0 is the fanciful value of Gabor filter, o is the bandwidth and y is the aspect ratio. 
Constraints to check the iris curvilinear minimum and maximum structures are show in. Minimum curvature 
patterns of the image are computed as 
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Maximum curvature patterns of the image are computed as (8): 


max ax by max{* we Sie (8) 


Different models are generated using the minimum and maximum curvature of the image in this feature 
extraction procedure. 


3.2.3. Log inverse differential moment 

Log inverse differential moment (LIDM) is used to find the homogeneity of the image structures. 
The normalization factor (1 + (m1 — m2)?)7? is used to find the small regions from the heterogeneous areas 
at (m! and m°). Here, the heterogeneous images are used to define low LIDM and for homogeneous images 
higher LIDM are evaluated using the equation. 
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3.2.4. Max correlation inertia 
Max correlation inertia (MCI) is used to find the maximal correlation between the grey level linear 
dependence among the pixels at the given positions. The maximum correlation and inertia measure describe 
the linear structure of an image. Also, it describes the distribution of grey scale values in an image. 
G-1 G-1 
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3.3. Bayesian based non-linear SVM classification 

In the framework, different feature sets in the biometric images are classified using the hybrid non- 
linear SVM algorithm. The kernel values are modified at each point to remove the input image functionality. 
Here, a non-linear kernel function is optimized using the multi-modal facial features with multiple classes. 
Bayesian estimation in the SVM classifier improves the conditional estimation of each feature in the multi- 
modal feature space. Thedecision boundary of the proposed multi-class SVM is given as. 


p = sgn (> Yi- Ø (Ki, yi) + ) 
i=1 


4. EXPERIMENTAL RESULTS 

Experimental results are simulated in real-time cloud computing environment. Results are developed 
using the python environment for feature extraction and classification process. In this work, different IoT 
captured video frames are taken as input to the proposed model in order to filter the essential key patterns and 
facial features. In these experimental results, different performance metrics such as number of key features, 
classification recall, precision, accuracy, F-measure, error rate, and runtime are computed and compared to 
the conventional models. The sample of in going facial expression images of the dataset for the training data 
preparation and landmark feature points detection process. 

Real-time face detection using the noisy frame. Here, proposed feature extraction measures are used 
to find the key features in the real-time videos. As shown in the sample video frame, human face is detection 
with high probability using the proposed feature selection measures for the classification problem. Figure 2 
illustrates the performance of proposed multiple features extraction count to the conventional facial feature 
extraction measures in the framework when the threshold is 0.3. Here, threshold is used to filter the essential 
key features among the large number of facial feature space. Figure 3 illustrates the performance of proposed 
multiple features extraction count to the conventional facial feature extraction measures in the framework 
when the threshold is 0.5. Here, threshold is used to filter the essential key features among the large number 
of facial feature space. Table 1 illustrates the performance of proposed multiple features extraction count to 
the conventional facial feature extraction measures in the framework when the threshold is 0.7. Here, 
threshold is used to filter the essential key features among the large number of facial feature space. 
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Figure 2. Performance analysis of proposed multiple feature extraction measures to the traditional measures 
for essential key features filtering when threshold T=0.3 in the framework 
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Table 2 describes the performance of proposed multiple facial feature extraction-based classifier to 
the conventional models for accuracy measure. In this table, the average accuracy value of all the test videos 
is taken as accuracy comparison between the proposed and existing models. Table 3 describes the 
performance of proposed multiple facial feature extraction-based classifier to the conventional models for 
recall measure. In this table, the average accuracy value of all the test videos is taken as recall comparison 
between the proposed and existing models. 
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Figure 3. Performance analysis of proposed multiple feature extraction measures to the traditional measures 
for essential key features filtering when threshold T=0.3 in the framework 


Table 1. Performance analysis of proposed multiple feature extraction measures to the traditional measures 


for essential key features filtering when threshold T=0.7 in the framework 
Video data =PSO+BSVM+CNN_ — PCA+RF+CNN — MI+CNNSVM __ FSBNN _ Proposed model 


#1 50 56 65 72 43 
#2 55 54 51 77 46 
#3 31 59 56 80 45 
#4 58 61 51 75 38 
#5 49 62 50 77 35 
#6 31 64 55 73 43 
#7 38 63 51 83 38 
#8 32 52 58 76 38 
#9 46 53 62 71 36 
#10 39 55 59 73 40 
#11 50 55 65 74 35 
#12 46 60 54 74 45 
#13 63 51 60 71 39 
#14 57 62 57 75 41 
#15 40 62 63 77 35 
#16 44 54 52 A. 44 
#17 65 52 64 76 46 
#18 60 53 59 76 36 
#19 60 51 60 78 43 
#20 35 63 62 72 37 


Figure 4 describes the performance of proposed multiple facial feature extraction-based classifier to 
the conventional models for recall measure. In this table, the average accuracy value of all the test videos is 
taken as precision comparison between the proposed and existing models. Table 4 describes the performance 
of proposed multiple facial feature extraction-based classifier to the conventional models for F-measure. In 
this table, the average accuracy value of all the test videos is taken as F-measure comparison between the 
proposed and existing models. 

Figure 5 describes the performance of proposed multiple facial feature extraction-based classifier to 
the conventional models for area under curve (AUC) measure. In this table, the average accuracy value of all 
the test videos is taken as AUC comparison between the proposed and existing models. Table 5, illustrates 
the performance of proposed multiple facial feature extraction-based classifier to the conventional models for 
runtime (ms) computation and Table 6, illustrates the error rate analysis of proposed classifier on different 
test facial features. From the table, it is noted that the proposed model has better minimization of error rate on 
the test data. 
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Table 2. Performance of proposed multiple facial feature extraction-based classifier to the conventional 


models for accuracy measure 


Video data PSO+BSVM+CNN PCA+RF+CNN MItCNNSVM FSBNN Proposed model 
#1 0.89 0.9 0.91 0.96 0.98 
#2 0.89 0.89 0.92 0.96 0.98 
#3 0.92 0.89 0.89 0.96 0.98 
#4 0.91 0.92 0.9 0.95 0.98 
#5 0.89 0.9 0.89 0.96 0.98 
#6 0.91 0.91 0.93 0.97 0.99 
#7 0.89 0.89 0.88 0.96 0.98 
#8 0.88 0.93 0.91 0.97 0.98 
#9 0.9 0.91 0.89 0.96 0.98 
#10 0.9 0.89 0.92 0.95 0.98 
#11 0.88 0.92 0.9 0.95 0.98 
#12 0.9 0.91 0.93 0.96 0.98 
#13 0.9 0.9 0.89 0.95 0.98 
#14 0.92 0.91 0.92 0.97 0.97 
#15 0.89 0.91 0.94 0.97 0.99 
#16 0.9 0.89 0.93 0.95 0.97 
#17 0.9 0.92 0.88 0.96 0.98 
#18 0.9 0.89 0.94 0.96 0.99 
#19 0.9 0.9 0.89 0.96 0.97 
#20 0.9 0.9 0.9 0.97 0.98 


Table 3. Performance of proposed multiple facial feature extraction-based classifier to the conventional 


models for recall measure 


Video data PSO+BSVM+CNN PCA+RF+CNN MItCNNSVM FSBNN Proposed model 
#1 0.9 0.91 0.93 0.96 0.98 
#2 0.9 0.89 0.89 0.96 0.97 
#3 0.9 0.91 0.94 0.95 0.98 
#4 0.9 0.91 0.92 0.96 0.98 
#5 0.91 0.92 0.93 0.95 0.97 
#6 0.92 0.91 0.92 0.96 0.97 
#7 0.89 0.9 0.88 0.95 0.98 
#8 0.89 0.91 0.93 0.97 0.98 
#9 0.89 0.89 0.91 0.95 0.98 
#10 0.9 0.9 0.91 0.97 0.98 
#11 0.89 0.93 0.89 0.96 0.98 
#12 0.91 0.89 0.92 0.95 0.98 
#13 0.88 0.92 0.93 0.96 0.98 
#14 0.88 0.88 0.9 0.97 0.98 
#15 0.89 0.9 0.93 0.95 0.98 
#16 0.9 0.91 0.92 0.96 0.98 
#17 0.89 0.9 0.88 0.95 0.98 
#18 0.9 0.9 0.93 0.97 0.98 
#19 0.89 0.9 0.92 0.95 0.98 
#20 0.9 0.91 0.88 0.95 0.98 
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s === PSO+BSVM+CNN 
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à == MI+CNNSVM 
0.85 
=e FSBNN 
0.8 ProposedModel 


#1 #3 #5 #7 #9 #11 #13 #15 #17 #19 


Test video data 


Figure 4. Performance of proposed multiple facial feature extraction-based classifier to the conventional 
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Table 4. Performance of proposed multiple facial feature extraction-based classifier to the conventional 
models for F-measure 


Video data _PSO+BSVM+CNN PCA+RF+CNN MI+CNNSVM FSBNN__ Proposed model 
#1 0.89 0.91 0.9 0.95 0.97 
#2 0.91 0.93 0.92 0.95 0.97 
#3 0.9 0.91 0.89 0.96 0.98 
#4 0.89 0.91 0.91 0.96 0.98 
#5 0.9 0.9 0.88 0.96 0.98 
#6 0.9 0.93 0.89 0.96 0.98 
#7 0.88 0.91 0.9 0.97 0.98 
#8 0.91 0.89 0.92 0.96 0.98 
#9 0.92 0.92 0.93 0.97 0.98 
#10 0.91 0.89 0.89 0.96 0.98 
#11 0.88 0.88 0.9 0.96 0.98 
#12 0.88 0.89 0.91 0.96 0.98 
#13 0.89 0.9 0.91 0.96 0.98 
#14 0.9 0.88 0.9 0.95 0.97 
#15 0.88 0.89 0.92 0.96 0.97 
#16 0.92 0.9 0.9 0.96 0.97 
#17 0.91 0.89 0.93 0.96 0.98 
#18 0.92 0.88 0.92 0.95 0.98 
#19 0.9 0.91 0.91 0.96 0.99 
#20 0.9 0.93 0.9 0.96 0.98 
1 
0.98 
0.96 A ee 
© 0.94 
© 
5 0.92 
S 09 
<q O > Oy” $ 
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Figure 5. Performance of proposed multiple facial feature extraction-based classifier to the conventional 
models for AUC measure 


Table 5. Performance of proposed multiple facial feature extraction-based classifier to the conventional 
models for runtime(ms) computation 


Video data PSO+BSVM+CNN__ PCA+RF+CNN _ MI+CNNSVM__FSBNN___ Proposed model 


#19 
#20 


21371 
12593 
13549 
11597 
10297 
16104 
10856 
20845 
13673 
22784 
17215 
17059 
10502 
19436 
10005 
15327 
18514 
11959 
18775 
13098 


15628 
10512 
15159 
10412 
10000 
10660 
12590 
21257 
21257 
20585 
16677 
15573 
20049 
11415 
12830 
18831 
22045 
22853 
19509 
18927 


14262 
12180 
19308 
12980 
21110 
12389 
23287 
12072 
13509 
12350 
10627 
21037 
11688 
19972 
18455 
15419 
20755 
16387 
18247 
16431 


3727 
0986 
7961 
4252 
2406 
2655 
7152 
20528 
9571 
5084 
9181 
7840 
6981 
23709 
22857 
5934 
21038 
23680 
20986 
1453 


29420 
27313 
28118 
26480 
27668 
26352 
26121 
26576 
26616 
26396 
28166 
28673 
25314 
26520 
27066 
27792 
26331 
25600 
27172 
27691 
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Table 6. Performance of proposed multiple facial feature extraction-based classifier to the conventional 


models for error rate 
Video data PSO+BSVM+CNN PCA+RF+CNN MI+CNNSVM FSBNN Proposed model 


#1 0.1 0.1 0.08 0.04 0.02 
#2 0.06 0.08 0.08 0.03 0.01 
#3 0.09 0.08 0.08 0.05 0.02 
#4 0.12 0.11 0.09 0.03 0.01 
#5 0.06 0.1 0.09 0.03 0.01 
#6 0.1 0.08 0.09 0.04 0.02 
#7 0.08 0.1 0.08 0.04 0.01 
#8 0.07 0.1 0.09 0.03 0.01 
#9 0.09 0.09 0.09 0.05 0.01 
#10 0.08 0.1 0.1 0.04 0.02 
#11 0.1 0.09 0.09 0.05 0.01 
#12 0.08 0.09 0.09 0.03 0.02 
#13 0.12 0.09 0.1 0.03 0.02 
#14 0.1 0.1 0.09 0.04 0.01 
#15 0.09 0.1 0.1 0.04 0.01 
#16 0.1 0.09 0.1 0.04 0.02 
#17 0.06 0.08 0.09 0.04 0.02 
#18 0.08 0.09 0.09 0.03 0.02 
#19 0.06 0.1 0.08 0.04 0.01 
#20 0.07 0.09 0.08 0.03 0.01 


4.1. Result analysis 

In this work, a multi-level facial feature extraction-based ensemble classification framework is 
implemented on different facial expression datasets. As discussed in the experimental section, proposed 
model has better accuracy, precision, recall, F-measure, AUC and runtime than the traditional approaches 
such as PSO+BSVM+CNN, PCA+RF+CNN, MI+CNNSVM and FSBNN. Also, proposed model has better 
error rate (~10%) than the conventional models. 


5. CONCLUSION 

In this paper, an efficient homogenous facial features extraction and classification framework is 
proposed to extract different features for the classification problem. Since, most of the traditional single modal 
facial features have limited features space for the classification problem. In this work, a hybrid classifier is 
used to classify the key facial points in the cloud computing environment. Experimental results show that the 
proposed hybrid multiple feature extraction-based framework has better computational efficiency in terms of 
accuracy, error rate, recall, precision and AUC than the conventional models. In future work, this modal is 
extended to implement a novel feature extraction and segmentation based multi-class classification framework 
on different multi-modal biometric features with a large features space and data size. 
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